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Abstract 

The large amount of data on galaxies, up to higher and higher red- 
shifts, asks for sophisticated statistical approaches to build adequate 
classifications. Multivariate cluster analyses, that compare objects for 
their global similarities, are still confidential in astrophysics, probably 
because their results are somewhat difficult to interpret. We believe 
that the missing key is the unavoidable characteristics in our Universe: 
evolution. Our approach, known as Astrocladistics, is based on the 
evolutionary nature of both galaxies and their properties. It gathers 
objects according to their "histories" and establishes an evolutionary 
scenario among groups of objects. In this presentation, I show two 
recent results on globular clusters and earlytype galaxies to illustrate 
how the evolutionary concepts of Astrocladistics can also be useful for 
multivariate analyses such as K-means Cluster Analysis. 



1 Introduction 

We are now able to study galaxies in great detail, identifying individual 
stars, gas and dust clouds, as well as different stellar populations. Imagery 
brings very fine structural details, and spectroscopy provides the kinemat- 
ical, physical and chemical conditions of the observed entities at different 
locations within a galaxy. For more distant objects, information is scarcer, 
but deep systematic sky surveys gather spectra for millions of galaxies at 
various redshifts. The amount of data on galaxies, their number, their diver- 
sity, their complexity and that of their evolution, suggest that they should 
be envisaged as a population or an ensemble of populations. This implies 
the use of the appropriate statistical tools. 

Like paleontologists, we observe objects from the distant past (galaxies at 
high redshift), and like evolutionary biologists, we want to understand their 
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relationships with nearby galaxies, like our own Milky Way. Consequently, 
a " galactogenesis" can be advantageously approached by considering phylo- 
genetics methods. 



2 Why be multivariate? 

The description of a given galaxy requires many observables, most of them 
derived from a spectrum. Usual classifications, often inspired by the Hubble 
tuning fork, use only a very few properties. Even if bivariate plots or cor- 
relations are clear and useful, they are incomplete. Worse, they are merely 
the projection onto a 2-D diagram of a multivariate parameter space. This 
projection is generally expected to increase the dispersion of the plot. Any- 
how, it is difficult to represent many data with only bivariate plots, and 
any classification necessarily requires an arbitrary binning of one or several 
parameters. 

Multivariate analyses are still not much used in astrophysics. One ba- 
sic tool, the Principal Component Analysis, i s relatively well-known (e.g. 
Cabanac et al. . 20021 : Recio-Blanco et al. . 20061 ) . but this is not a clustering 



tool in itself. A very few attemp ts to apply multivariate clustering m e thods 



have been made very recently (IChattqpadhvay and Chattopadhva 



tly ( L-nattopadnvav and L-nattoDadnvav, 
20071 : IChattopadhvav et all . [20071 . 120081 . l20Q9al Jbl: iFraix-Burnet et al 



2006, 



2009, 



20101 ). Sophisticated statistical tools are used in some areas of astrophysics 
and are developing steadily, but multivariate analysis and clustering tech- 
niques have not much penetrated the community. It is true that the inter- 
pretation of the results are not always easy. 



3 Why be evolutive? 

Evolution, an unavoidable fact, is also not correctly taken into account in 
most classification methods. By mixing together objects at different stages of 
evolution, most of the physical significance and usefulness of a classification 
is lost. In practice, the evolution of galaxies is often limited to the evol ution 



of th e properties of the entire population as a function of redshift ([Belli . 



20051 ). Since environment (the expanding Universe) and galaxy properties 
are so much intricate, this kind of study is relevant to a first approximation. 
However, recent observations have revealed that galaxies of all kinds do 
not evolve perfectly in parallel, as illustrated for instance by the so-called 
downsizing effect whi ch shows that l arge g alaxies formed their stars earlier 



than small ones (e.g. iNeistein et all 120061 ). New observational instruments 



now bring multivariate information at different stages of evolution, and in 
various evolutive environments. In this multivariate context, we believe 
that the notion of "evolution", easy to understand for a single parameter, is 
advantageously replaced by "diversification". 
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The transformation of galaxies is a complex process (jFraix-Burnet et al 



that cannot be disentangled with only a very few observables. For 



instance, the elliptical shape of galaxy can be obtained through the mono- 
lithic collapse of a big cloud of gas, or by big mergers. To find which process 
has shaped a given galaxy, many observables are required. Only a multi- 
variate and evolutive analysis can distinguish different histories. 

4 Classification, complexity, evolution 

Multivariate clustering methods compare objects with a given measure and 
then gather them according to a proximity criterion. There are two main 
classes. Firstly, distance analyses are based on the overall similarity derived 
from the values of the parameters describing the objects. The choice of 
the most adequate distance measure for the data under study is not unique 
and remains difficult to justify a priori. The way objects are subsequently 
grouped together is also not uniquely defined. Secondly, methods based on 
characters (a trait, a descriptor, an observable, or a property, that can be 
given at least two states characterizing the evolutionary stages of the ob- 
j ect for that charac ter) compare objects in their evolutionary relationships 



(| Wilev et al.1 . 119911 ). Here, the "distance" is an evolutionary cost simply 
measured by the number of changes of the parameter values (or charac- 
ter states). Groupings are then made on the basis of shared or inherited 
characteristics, and are most conveniently represented on an evolutionary 
tree. 

Character-based methods like cladistics are better suited to the study 
of complex objects in evolution, even though the relative evolutionary costs 
of the different characters is not easy to assess. Distance-based methods 
are generally faster and often produce comparable results, but the over- 
all similarity is not always adequate to compare evolving objects. In any 
case, one has to choose a multivariate metho d, and the results are generall y 



somewhat different depending on this choice ([Buchanan and Collardl . l2"008h . 



However, the main goal is to reveal a hidden structure in the data sample, 
and the relevance of the method is mainly provided by the interpretation 
and usefulness of the result. 

We must note that taking all available parameters blindly can kill the 
multivariate and evolutive analysis. One dangerous component is a hidden 
correlation, such as a size effect, that creates a redundancy. A less known 
caveat is due to spurious correlations, due to independant variables that 
vary as function of a non-necessarily obvious parameter. This is especially 
the case with the time or the stage of evolution. Two quantities can be 
totally unrelated but if they vary both with time in a more or less mono- 
tonic way, then they appear to be correlated. For instance, all photometric 
quantities for galaxies are affected by the stellar evolution. In such a case, a 
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cladis tic analysis yields a regular tree showing the regular stellar evolution 
(e.g. Fraix-Burnet . 20061 ). 

Multivariate evol utionary classification in as trophysics has been pio- 
neere d by the author (|Fraix-Burnet et alJ . l2006al B. l2009l . l201ol :l Fraix-Burnetl . 
2009h . Called astrocladistics, it is based on cladistics that is heavily devel- 



op edjnjsyohrtioriaxyjDi^^ has been first applied to galax- 

(|Fraix-Burnet et all l2006bl . l20ld ) because they can be shown to follow a 



ICS 



transmission with modification process when they are transformed through 
assembling, internal evolution, interaction, merger or stripping. For each 
transformation event, stars, gas and dust are transmitted to the new object 
with some modifica tion of their properties. C ladistics has also been applied 
to globular clusters ( Fraix-Burnet et al. . 20091 ). where interactions and merg- 
ers are probably rare. These are thus simpler stellar systems, even though 
we have firm evidence that internal evolution can create another generation 
of stars and that globular clusters can lose mass. Basically, the properties 
of a globular cluster strongly depend on the environment in which it formed 
(chemical composition and dynamics), and also on the internal evolution 
which includes at least the aging of its stellar populations. Since galaxies 
and globular clusters form in a very evolving environment (Universe, dark 
matter haloes, galaxy clusters, chemical and dynamical environment), the 
basic properties of different objects are related to each other by some evo- 
lutionary pattern. 



5 A more pertinent physical interpretation 

An obvious difficulty for a physicist in general is to intepret the results of 
multivariate analyses using his models that mostly result from a set of equa- 
tions and are more conveniently presented by curves on bivariate plots. In- 
terestingly enough, these models are multivariate, especially in astrophysics, 
and the resolution of the set of equations yields a "population" of possible 
results often called a grid of models. As a result, some parameters are set to 
sensible values, and the corresponding models are then compared to some 
observables. These observables can also have been truncated by setting some 
other observables in order to simplify the information. 

It appears that we must here compare two populations, one of real ob- 
jects and one of models, in a multivariate space. We show here two examples 
of multivariate (and evolutive) analyses of astrophysical objects showing that 
such approaches are both more direct, objective and physically pertinent. 

Figure [T] shows the cladogram obtained for globular clusters of our Galaxy 
and the projection of the partitioning on pair plots for the four parameters 
used for the analysis: logTe, that measures the temperature of stars that 
are at a specific point in their evolution, Fe/H, MV that is the total visible 
intensity (magnitude) and roughly indicates the mass of the globular cluster, 
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Figure 1: Top: cladogram obtained for the globular clusters of our Galaxy. 
Bottom: projection of the partitionin g on pair plots with t he fou r parameters 
used for the cladistic analysis. From I Fraix- Burnet et al. (|2009h . 
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Figure 2: The fundamental plane of 699 galaxies, showing the par- 
titioning and projection of the tree obtained by a cladistic analysis 



(|Fraix-Burnet et all lioioh . 



and Age that can be measured quite precisely because all stars of a given 
globular cluster are formed nearly at the same time. However Age is not an 
intrinsic property discriminating evolutionary groups since it evolves in the 
same way for all. But we gave it a half weight to arrange the objects within 



each group ( Fraix-Burnet et al. . 2009I ). Three groups are identified. The 



first one (in blue) has on average the lower ratio Fe/H that measures the 
proportion of heavy atomic elements that are processed within stars. This 
group is consequently considered as more primitive. It is obivous that this 
partitioning would be impossible to obtain with only bivariate plots. 

Looking at other parameters (such as orbital elements, kinematics, more 
refined chemical abundances...) revealed clear characteristics that allowed 
us to infer that each group formed during a particular stage of the assembly 
history of our Galaxy. The blue group is the older one. It formed during 
the dissipationless collapse of the protogalaxy. They are located mainly in 
the outer halo. The red group belongs to the inner halo and the corre- 



6 



Astronomical Data Analysis, 6th conf., 3-7 May 2010, Monastir, Tunisia 






-19 -18 -17 -16 -15 9.5 10.0 10.5 11.0 11.5 12.0 12.5 

MRabs log(Mdyn) 




Figure 3: Cluster and cladistic analysis of the fundamental plane of early- 
type galaxies: bivariate plots showing how correlations differ for each group 
and for the whole sample. Note in particular the Mgi v s log a plot revealing 



a spu rious correlation between these two parameters (jFraix-Burnet et al 
2O10h . 



Astronomical Data Analysis, 6th conf., 3-7 May 2010, Monastir, Tunisia 



sponding clusters formed at a later stage during the dissipational phase of 
Galactic collapse, which continued in the halo after the formation of the 
thick disc and its globular clusters. These clusters were very massive before 
"star evaporation" took place. The latter group (green) formed during an 
intermediate and relatively short period and compris es clusters of the disk 
of our Galaxy (all details in Fraix-Burnet et al. . 20091 ). 

Another example is given with the fundamental plane of early-type galax- 
ies which is a lonog-known correlation between the central velocity disper- 
sion a, the surface brightness \i e and the effective radius r e . In addition, 
the metallicity, as measured with the Mg2 index, plays a role and seems 
to be correlated with logo". W e performed a K-means cl uster analysis and 
a cladistic analysis in parallel ( Fraix-Burnet et al. . 20ld ). The partionings 
are remarkably in agreement. We believe the reason is due to the careful 
choice of the parameters. For cladistics, they must be informative with re- 
spect to diversification, and should not be redundant or incompatible. This 
requirement is logically pertinent also for any cluster analysis. 

Cladistics provides in addition the evolutionary relationships between 
the groups. On Figure [21 the tree is projected onto the logo" vs fi e diagram 
on which the fundamental plane is seen essentialy face-on. Since galaxies 
are more complicated than globular clusters, the interpretation of the results 
and all relations between all possible parameters and within each group takes 
great advantage of numerical simulations. Here again, we are able to derive 
the probable history of each group of galaxies as well as their relative level of 
diversification, giving possible sequences of past transforming events such as 
mergers, accretions or sweeping (for details, see Fraix-Burnet et al. . 2010h . 



A quite interesting finding is that most known correlations are different 
or even absent when we consider groups individually (Figure [3J . This proves 
that they have different evolution histories. Another noticeable fact is that 
the well-known correlation between Mg2 and logo" is indeed spurious, or 
historical. It is due to the fact that each parameter changes with the level of 
diversification as clearly shown by the placement of the groups (see Figure [3j). 



6 Conclusion 

Undoubtly, the study of galaxies now requires multivariate statistical treat- 
ments. Evolution must also be taken into account and the concept of popu- 
lations seems appropriate and points to the use of methodologies developed 
elsewhere. Complexity, evolution and classification suggest similar studies 
as in phylogenetics. Astrocladistics has opened the pathway. 
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